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Abstract 

We apply the spike- an d-slab Restricted 
Boltzmann Machine (ssRBM) to texture 
modeling. The ssRBM with tiled-convolution 
weight sharing (TssRBM) achieves or sur- 
passes the state-of-the-art on texture synthe- 
sis and inpainting by parametric models. We 
also develop a novel RBM model with a spike- 
and-slab visible layer and binary variables in 
the hidden layer. This model is designed to 
be stacked on top of the TssRBM. We show 
the resulting deep belief network (DBN) is a 
powerful generative model that improves on 
single-layer models and is capable of model- 
ing not only single high-resolution and chal- 
lenging textures but also multiple textures. 



1 Introduction 

Texture processing is one of the essential components 
of scene understanding in human vision. Natural im- 
ages can be seen as a large mixture of heterogeneous 
textures. Thus, to a certain extent, progress in model- 
ing natural images requires that we make progress in 
modeling textures. To this end, texture modeling has 
been an active research area of machine learning, com- 
puter vision and graphics during the past five decades. 



Although nonparametric approaches ( Lin et al , 2006 ) 



have made significant progress in synthesizing textures 
from example images, capturing the statistical prop- 
erties of textures via a probabilistic model remains an 
active area of inquiry. Such probabilistic models are 



important for modeling natural images (Heess et al 



2009) but also for understanding human vision (Zhu 



et 



2000). 



In this work we consider a probabilistic model of tex- 
tures based on the spike-and-slab Restricted Boltz- 
mann Machine (ssRBM) ( [Courville et al] |2011a|b ). 



The ssRBM has previously demonstrated the ability 
to generate samples of small natural images that pre- 



served much of their statistical structure (Courville 
2011bD . This would suggest that the ssRBM 



et al. 



is potentially well suited to the task of texture mod- 
eling. Following the recent exploration of Boltzmann 



machines for textures ( ^Kivinen and Williams 2012), 
we have trained ssRBMs with tiled-convolution weight 
sharing Gregor and LeCun j^2010 ); Le et al. (2010) on 
the Brodatz-texture images _U Tiled convolution allows 
weight sharing in filters with non-overlapping receptive 
fields. The use of tiled convolution is a particularly ap- 
propriate choice of model architecture in the context 
of texture modeling. The weight sharing allows the 
model to synthesize texture patches of variable size, 
while the tiling pattern of weight sharing allows us to 
efficiently devote model capacity to modeling the local 
texture patches. 



In Kivinen and Williams (2012), the authors con- 



centrated their quantitative evaluation of the texture 
models on a subset of the Brodatz textures that ex- 
hibit strong spatial invariance, i.e. textures largely 
consisting of a regular repeating pattern. While this is 
an important problem in its own right, most natural 
textures (i.e. those associated with a natural-looking 
world) exhibit significant spatial non-stationarity and 
features with a wide spatial frequency range. One pop- 
ular way to deal with images with a wide spatial fre- 
quency range is to decompose the frequencies using, 
for example, a Laplacian image pyramid. However, 
since many textures have features that interact across 
spatial resolutions, spatial pyramids would appear to 
be inappropriate. We propose that deep convolutional 
generative architectures are well suited to model these 
natural textures. In particular, by increasing the ef- 
fective receptive field with depth, we can use higher 



^ www . ux . uis . no/~trcLiiderL/brodatz . html 



layers of the model to efficiently communicate infor- 
mation such as phase to spatially isolated parts of the 
ffist layer model. 

Deep belief networks also have another important 
property that we find useful in the context of texture 



modeling. As argued by Hinton (2012); Le Roux and 
Bengio ( 2008| ), training the lower layers by contrastive 
divergence (CD) (Hinton et al 2006) allows the lower 
layers to concentrate on modeling local features of the 
data. We have found best results by training the lower 
layers by CD and the uppermost layer by a closer ap- 
proximation to maximum likelihood, such as persistent 
contrastive divergence (PCD) |Tielema n| (p008|, pro- 
moting a better division of labour between the layers 
of the DBN. On this account, the ssRBM offers an im- 
portant advantage over other similar models in the lit- 



erature. For unlike models such as the mcRBM (Ran- 



zato and Hinton, 2010) and mPoT (Ranzato et al. 



2010a), the structure of the ssRBM makes it readily 



amenable to CD training. 

Our contributions are, first, the exploration of a 
tiled-convolutionally trained ssRBM (TssRBM) tex- 
ture model and its objective comparison with the other 
similar models in the literature. We show that the Tss- 
RBM is competitive with the state-of-the-art on tex- 
ture synthesis and inpainting tasks on a selection of 
Brodatz textures. Second, we develop a novel RBM 
model with a spike- and-slab visible layer and binary 
variables in the hidden layer. This model is design 
to be stacked on top of the TssRBM within a deep 
belief network configuration, with each layer trained 
convolutionally with a greedy layer-wise pretraining 
strategy. We demonstrate how the resulting two and 
three-layer DBN (the third layer is a standard RBM) 
models are able to encode longer term dependencies in 
the higher layers while simultaneously recovering more 
detailed structure in the CD-trained lower layer, all 
of which translates to superior texture model perfor- 
mance - particularly when the textures being modeled 
exhibit strong non-stationarity. Finally, we show how 
the depth helps in learning a generative model of mul- 



tiple textures. Kivinen and Williams| ([2012| introduce 
a model capable of modeling multiple textures, how- 
ever they make use of label information in the train- 
ing process alleviating the difficult learning problem of 
constructing multiple modes to represent each texture. 
In this work, we show how a deep belief network based 
on the ssRBM is capable of learning to model multiple 
textures based on purely unsupervised training. 

2 Previous Work 

The problem of texture synthesis has been extensively 
studied in the computer vision community for decades 



ture synthesis strategies are currently example-based 
or nonparametric methods ("Wei et a/., 2009). These 



typically seed a target image with transformed versions 
of patches drawn from the target texture. While these 
methods are fiexible, they are unlikely to be readily ap- 
plicable to natural textures, where some aspects of the 
statistical structure (eg. the path of the duck tracks) 
are global in scope. 

The Gaussian RBM f Welling et~al\ |2005[ [Ranzato 



and Hinton 2010J models real- valued observations by 



adding quadratic terms on the visible units to the stan- 
dard binary-binary RBM energy function. One limita- 
tion of the Gaussian RBM is that changing its hidden 
unit activations only changes the conditional mean of 
the visible units. For modeling natural images, it has 
been found important to allow the hidden unit config- 
uration to capture changes in the covariance between 
pixels, and this has motivated several of the mod- 
els discussed below as well as the ssRBM. The prod- 



uct of Student's T-distributions (PoT) model ( Welling| 



et al. 2003 ) is an energy-based model where the condi- 



tional distribution over the visible units conditioned on 
the hidden variables is a multivariate Gaussian (non- 
diagonal covariance) and the complementary condi- 
tional distribution over the hidden variables given the 
visibles are a set of independent Gamma distribu- 
tions. The PoT model has recently been generalized 



to the mPoT model (Ranzato et a/., 2010b) to include 



nonzero Gaussian means by the addition of Gaussian 



RBM-like hidden units. In the same work, Kivinen and 



[Williams] ([ 2012| ) explored the "Multi-Texture Boltz 
mann Machine" (Multi-Tm), training a single large 
Gaussian RBM (up to 256 feature maps par tiling po- 
sition, as opposed to 32 maps per tiling position) on 
multiple textures. In modeling multiple textures, jKivi- 



nen and Williams (2012 ) used label information during 



the training process to enable the model to focus on a 



single texture class at a time. In section 5.3, we show 
how we can use a deep belief network, based on the 
ssRBM, to learn a model of multiple textures using no 
label information at all. 

In addition to validating the ssRBM as the basis of an 
effective texture model, we also set out to study the 
impact of adding layers to the tiled-convolutional ss- 
RBM model, in order to see if depth can help maintain 
coherence of large scale texture features. Recent work 



(Ranzato et al. , 2011) has shown that stacking addi- 



tional RBM layers on top of an mPoT model (also 
trained using tiled-convolutional weight sharing) can 
have a dramatic impact of the ability of the model 
to generate globally coherent natural image samples. 
Findings such as these motivated our attempt to use 
depth to synthesize textures with increased global co- 
herence. 



(Zhu et a/., 2000). Probably the most popular tex- 



3 Spike-and-Slab RBM 

The ssRBM describes the interaction between three 
sets of random variables: the real- valued visible ran- 
dom vector V G representing the observed data of 
dimension the set of binary "spike" random vari- 
ables h G [0,1]^ and the real- valued "slab" random 
variables s G M^. The ssRBM has the interpretation 
that, with N hidden units, the zth hidden unit is as- 
sociated with both an element hi of the binary vector 
and an element si of the real- valued variable. In this 
work we will concern ourselves with the ssRBM for- 



mulation referred to as the /i-ssRBM ( Courville et al. 



2011b) with the associated energy function: 



E{V, S,h) = - Etl y^W^S^h^ +\v^ (A + ^tl ^^^^) ^ 
^ N N N ^ N 

+ -^^25? -^^ai/JiSihi - bjhj + 2 an^hj, 

i=l i=l i=l i=l 

(1) 

where Wi G denotes the ith weight (or fea- 
ture) vector, hi is a scalar bias associated with the 
spike variable /i^, jUi and are respectively a mean 
and precision parameter associated with the random 
slab variable 5^, A is a diagonal precision matrix on 
the visibles and is an /i^-gated contribution 
to the precision on v. As is standard with energy- 
based models, the joint probability distribution over 
s = [si,...,5Ar] and h = [hi^ . . . ^Hn] is specified 
as: p{v,s,h) = ^ exp {— ^('U, s, /i)}, where Z is the 
normalizing partition function. 

An interesting property of the ssRBM is that de- 
spite having higher-order interactions of variables, the 
model maintains the bipartite graph structure of the 
standard restricted Boltzmann machine where the ith 
hidden unit consists of the product of the random vari- 
ables Si and hi. This property implies that, unlike the 
mPoT (which also models conditional variance), the 
ssRBM shares the simple and practical conditional in- 
dependence structure of the standard restricted Boltz- 
mann machines. This makes it easy to use efficient 
block Gibbs sampling. As seen in the model condi- 
tionals: / N \ 

P{V \ S,h) = A/" I C^\s,h XI ^i^i^i ' Cv\s,h I , (2) 
^ /I 

P{h\v) = Y[a^-a-\v''W^f + v''W^^l^ (3) 
i=l ^ 

-^v'^'^iV + h^ , (4) 

p{s\v,h) = l[Af(^(a-\^Wi + fii)h^ , a-'),{5) 

where A/'(/i, C) denotes a Gaussian distribution with 
mean fi and covariance C, a represents a logistic sig- 



moid, and Cy\s,h = (^A + YliLi ^i^i^ is the diagonal 
conditional covariance matrix. 

Training the ssRBM: Like the standard RBM, 
learning and inference in the ssRBM is rooted in the 
ability to efficiently draw samples from the model via 
block Gibbs sampling. In training the ssRBM we 
are free to use either contrastive divergence or a bet- 
ter approximation to maximum likelihood such as the 
stochastic maximum likelihood algorithm, also known 



as persistent contrastive divergence (PCD) (Tieleman 



2008). CD training involves approximating the nega- 
tive phase component of the likelihood gradient by a 
few steps (often just one) of Gibbs sampling away from 
the data presented in the positive phase. In PCD, one 
maintains a persistent Markov chain to approximate 
the negative phase and simulates a few Gibbs steps 
between each parameter update. These samples are 
then used to approximate the expectations over the 
model distribution p{v^s^h). Details regarding PCD 
training of the ssRBM are available in Courville et al. 



(2011a). 



Our use of block Gibbs sampling marks an important 
distinction between our approach to learning and that 



used by jKivinen and Williams (2012), who use Hy- 
brid Monte Carlo (HMC) ( |Neal[ |1994[ ) to draw sam- 
ples from the model distribution. Their use of HMC 
is likely motivated by their need to train models such 
as the PoT model where the conditional over visible 
vectors do not factorize and hence is not amenable 
to efficient block Gibbs sampling. The ability to eas- 
ily and efficiently Gibbs sample from the ssRBM also 
makes it amenable to CD training, unlike models such 
as the PoT and mPoT models. As we show in the ex- 
periments with deeper models, the use of CD training 
is crucial to achieving our best results. 



Tiled- Convolutional ssRBM: [Gregor and LeCun 
( |2010() introduced tiled-convolutional weight shar- 



ing (Ranzato et a/., 2010a Le et 



2010) and is 



similar to convolutional weight sharing ( LeCun et al.\ 
,1998| [Desjardins and Bengio| |2008[ |Lee et al.\ |2009| 



except that spatially neighboring features (with over- 
lapping receptive ffelds) do not share weights. Within 
the tiled-convolutional structure, every speciffc fflter 
ties the input images without overlaps with itself and 
at the same time different filters do overlap with each 
other. The first setting of tiled-convolution not only 
allows us to efficiently work on much bigger images 
than traditional convolutional models but also makes 
the states of hidden units less correlated, which is 
very helpful when we draw samples from the models 
by block Gibbs sampling. The second setting aims 
at removing the tiling artifacts introduced the non- 



overlapping filters. 

To make comparisons easier, our TssRBM uses the 



same architecture as Kivinen and Williams (2012) in- 



cluding the same receptive field size of 11 x 11 and the 
same diagonal tiling pattern with a stride of one pixel 
(neighboring receptive fields are offset by one pixel). 
This diagonal tiling (which reduces considerably the 
number of free parameters) makes for 11 sets of filters 
(one for each offset). We also kept constant the num- 
ber of filters (32 per set) to make comparisons with 
the results in [Kivinen and Williams| ( [2012 ) simpler. 



4 An ssRBM-based Deep Belief 
Network 

In this section, we describe how we extend the Tss- 
RBM in a hierarchical generative model in the form of 
a deep belief network (DBN). Following the standard 
procedure for learning DBNs, we follow a layer- wise 
training strategy. Training the bottom layer ssRBM, 
either by CD, PCD or FPCD is straightforward and 
discussed above in Sec. [U We now consider the form 
of the model we intend to stack on top of the ssRBM. 

Following the DBN approach, we express the ssRBM 
model as 

^ssRBm(^) 

s,h 

As discussed in the previous section, due to the fac- 
torial nature of P{v\s,h), it is convenient to con- 
sider this the bottom layer of our DBN and focus 
on how to model the spike- and-slab latent state. Let 
P^{v) denote the data distribution. We introduce an- 
other higher-layer model of the spike-and-slab state 
Q{s^h) to model the aggregated posterior distribution, 
Q^{s,h), of the ssRBM 

Q^.,/.) = ^P(.,/.|^)P^^) 

V 

If Q{s,h) models the aggregated posterior Q^{s,h) 
better than does P(s, h) (defined by the ssRBM), then 
adding the second layer can improve the model of the 



training data (Hinton et al. 2006). 



Formally, the two layer model is, 

^2-layer(^) = ^ P{v\s,h)Q{s,h) 

s,h 

From a generative perspective, the sampling procedure 
consists of generating a sampling pair (s, h) from the 
top (second here) layer, followed by mapping them to 
image space though P{v\s^ h). 

We have yet to specify the form of the model Q{s^h). 
We will follow the common practice of using another 
model of the RBM family to model the distribu- 
tion over s and h. We introduce a variant of the 



RBM: P{s^h^g)^ which models the aggregate poste- 
rior (3^(s, h) through a hidden a binary random vector: 
g G {0, 1}^. We choose to use a binary hidden layer 
in order to transition to a more standard binary repre- 
sentation. When we include a third layer to the DBN, 
then that layer will be formed by training a standard 
binary-binary RBM. 

The energy function of the second layer model is de- 
fined as follows: 



N M 



M 



E{s,h,g) = - ^^gjUijSihi - ^ 



Pj9j 



=1 j=i 



^ N N N 

where Uij refers to the ijth element of the weight ma- 
trix encoding the interactions between gj and spike- 
and-slab variables hi and Si respectively. The term pj 
controls the bias on the binary gj. All other param- 
eters have the same interpretation as their first layer 
analogues. 

Similar to the standard ssRBM, the conditionals 
P{h I ^), P{s I h^g) and p{g \ s^h) are factorial and 
given by: 



p(s I v,h)^Af yya^ ^"^^yjWij +pi]hi , a/ 
p{gj = 1 I s, /i) = cr (^^2 ^ij^^^^ + 



The structure of this model gives us two advantages. 
First, at the start of training the second layer we can 
make Q{s^ h) close to P(5, h) defined by the first layer 
ssRBM by initializing the corresponding parameters 
to match their first layer analogues' values. Second, 
after training the second layer, we get a new binary 
representation for training data. Based on it, building 
a even deeper model is straightforward. In our exper- 
iments, this architecture works very well. 

Training the second-layer model: After pretrain- 
ing the first layer (ssRBM), given training data we 
sample h from p{h \ v)^ then take E^^^|^^>j[5] and 
p{h I v) as the new training data to train the sec- 
ond layer. Just as we do for the bottom-layer ssRBM, 
we train this second-layer RBM with either PCD or 
with CD. We typically see best results if we train with 
PCD for the top-layer model and with CD for all other 
layers. 




Feature Maps 



Feature Maps 




Figure 1: The architecture of the lowest two layer. The 
first layer possesses tiled-convolutional weight sharing in a 
diagonal arrangement (tilings are represented by different 
colors). Each second layer unit has a 2 x 2 receptive field 
over all the feature maps in the first hidden layer. The sec- 
ond layer is arranged with traditional convolutional weight 
sharing and a stride of 1. 

Sampling and inference in our two layer model: 

Once the second layer has been trained with PCD it 
can be used to generate samples. We run Gibbs sam- 
pling in the top layer, getting the sample g. Next, 
we sample h from P{h \ g) then pass P{h \ g) and 



E 



'p{s\g,h) 



[s] to the first layer. Inference in our two layer 
model is exactly the same to the process of converting 
training data into the new representation (spike and 
slab variables) discussed above. Given we sample 
h from p{h \ v)^ then pass IE^(^5|^ f^^[s] and p{h \ v) to 
higher layer. 



Convolutional Structure: The second layer pos- 
sesses a convolutional weight sharing structure (not 
tiled-convolutional). Based on our use of patches of 
size 98 X 98 randomly cropped from the texture images, 
the tiling structure of the first layer model results in a 
set of 32 X 11 feature maps of size 8x8 (the receptive 
field size was 11 x 11). Second layer hidden units are 
each connected to all 32 x 11 feature maps with the 
same 2x2 receptive field across all feature maps. Us- 
ing a stride length of 1, this implies that each second 
layer feature is associated with a feature map of size 
7x7. For our experiments with a 3-layer model, we 
keep the same convolutional weight sharing structure 
for the third layer and use receptive fields of size 2x2. 

5 Experiments 

We evaluate our texture models on 8 texture images 
(D4, D6, D16, D21, D53, D68, D77 and D 1Q3) from 
the Brodatz texture dataset. Acording to |Lin et al 
( [2QQ6| ), we can roughly classify them into 4 different 
types, regular textures (D6, D21, D53, D77), near- 
regular textures (D16, D103), irregular textures (D68) 
and stochastic textures (D4). The regular textures 
are simpler: shallow models (such as mPoT, Gaus- 
sian RBM and ssRBM with tiled-convolutional weight 
sharing) are able to model them with high fidelity. 
However, the other textures (D4, D16, D68 and D103) 
remain challenging for shallow models. We show that 
deep models give better results. 



5.1 TssRBM texture modeling 

In this section, we compare the tiled-convolutional ss- 
RBM with other related models in the literature. We 



base our comparison on the results reported in Kivinen 
[and Williams] ( [213 1 2 1 ). To provide a fair comparison, we 
follow the general experimental protocol established by 
Heess et al] ( |2QQ9[ ) and [Kivinen and Williainsl ( [2Q12[ ). 



Specifically, we rescaled the original 640 x 640 textures 
(all but D16) to either 480 x 480 (D4, D21 and D77) 
or 320 X 320 (D6, D53, D68 and D103). Each texture 
image was divided into a top half for training and a 
bottom half used for testing. Then we report the per- 
formances of the TssRBM and our 2-layer TssRBM- 
based DBN on two tasks: texture synthesis and in- 
painting. All models, in all experiments, are trained 
on 98 X 98 sized patches randomly cropped from the 
preprocessed training texture images which are nor- 
malized to have zero mean and standard deviation of 
1. We use a minibatch size of 64. 



The TssRBM is trained with FPCD (Tieleman and 



Hinton 2009). For deep models, we always pretrain 



the lower layer with one step CD and train the top 
layer with PCD (We find that in the higher layer 
RMBs, the mixing of the negative phase Gibbs chain 
is relatively fast, so we use PCD). In both PCD and 
FPCD training processes, at the beginning of learning 
the persistent chains are initialized with noise and for 
some textures (especially for those regular textures) 
restarting the Markov chains with a small possibility, 
like 0.01, seems advantageous to further promote mix- 
ing. After training, we aply our models for the follow- 
ing two task: texture synthesis and inpainting. 

Texture Synthesis: For this task we generate un- 
constrained samples from our models by the usual 
DBN generative procedure, with Gibbs sampling in the 
top-level RBM, followed (in the case of deep models) 
by stochastic projection (except for the visible units, 
as usual, and except for the slab units, where we take 
the expectation) in image space. Following Kivinen 
and Williams (2012), after a large number of "burn- 



m" samples, we collected 128 samples of size 120 x 120 
for both the 1-layer and 2-layer models. A quantita- 
tive measure of the quality of the samples is provided 



by the Texture Similarity Score (TSS) (Heess et al. 



2009), comparing each generated sample with the test 



patches from the test region of the image. For a sample 
s and test texture x, the TSS is given by the maximum 
normalized cross correlation (NCC): 



TSS{s, x) = max 



Xi s 

killlkll ' 



X}S 



(8) 



where Xi denotes patch i within the test region of the 
image and L is the number of possible unique patches 




Figure 2: Examples of texture synthesis for the models under consideration (rows) for different textures (columns). The 
top row has original data. 




Figure 3: Examples of texture inpainting for the models under consideration (rows) for different textures (columns). 



Synthesis 


D6 


D21 


D53 


D77 


Bi-FoE 
TmPoT 
TPoT 
T-GaussRBM 
Multi-Tm (256) 


0.7573 ± 0.0594 
0.9329 ± 0.0356 
0.5641 ± 0.0916 
0.9301 ± 0.0207 
0.9304 ± 0.0280 


0.8710 ± 0.0317 
0.8961 ± 0.0696 
0.7388 ± 0.1055 
0.8901 ± 0.0792 
0.9346 ± 0.0205 


0.8266 ± 0.0869 
0.8527 ± 0.0559 
0.7583 ± 0.1082 
0.8485 ± 0.0606 
0.9231 ± 0.0103 


0.6464 ± 0.0215 
0.8699 ± 0.0080 

0.6870 ± 0.0973 
0.8663 ± 0.0084 

0.8610 ± 0.0096 


TssRBM 
Our 2-layer DBM 


0.9365 ± 0.0468 
0.9516 ± 0.0164 


0.9482 ± 0.0249 
0.9465 ± 0.0322 


0.9412 ± 0.0215 
0.9499 ± 0.0264 


0.8410 ± 0.0121 
0.8638 ± 0.0161 


Inpainting 


D6 


D21 


D53 


D77 


Efros&Leung 
TmPoT 
TPoT 
T-GaussRBM 
Multi-Tm (256) 


0.8524 ± 0.0318 
0.8629 ± 0.0180 
0.8446 ± 0.0172 
0.8578 ± 0.0160 
0.8452 ± 0.0173 


0.8566 ± 0.0344 
0.8741 ± 0.0116 
0.8609 ± 0.0275 
0.8662 ± 0.0185 
0.8673 ± 0.0103 


0.8558 ± 0.0578 
0.8602 ± 0.0234 
0.8935 ± 0.0159 
0.8494 ± 0.0233 
0.8554 ± 0.0284 


0.6012 ± 0.0760 
0.7668 ± 0.0322 

0.6379 ± 0.0373 
0.7642 ± 0.0267 

0.7328 ± 0.0615 


TssRBM 
Our 2-layer DBN 


0.8881 ± 0.0227 
0.8894 ± 0.0246 


0.9119 ± 0.0139 
0.9060 ± 0.0160 


0.9156 ± 0.0237 
0.9242 ± 0.0285 


0.7627 ± 0.0314 
0.7738 ± 0.0232 



Table 1: A comparison of the one and tw o-layer TssRBM results with other models. All reported results other than 
the TssRBM-based results were taken from Kivinen and Williams ( 2012| ) (including their Multi-Tm: a multiple texture 
model trained with 256 hidden units). The synthesis results are based on the TSS criterion while the inpainting results 
are based on MSSIM-scores. In both cases larger numbers are better. 




Figure 4: LEFT: Synthesized texture D53, D4 and D103 at full resolution. The training algorihtms are shown in the 
layer-order, e.g. 3-DBN: CD-CD-PCD denotes a 3-layer DBN trained with CD for the first two layers and with PCD for 
the upppermost layer. Both depth and the choice of inductive bias have a significant impact on the quality of the model. 
RIGHT: The autocorrelation spectrum of Monte Carlo Markov Chain samples of the texture D103 for our one, two and 
three- layer models. All layers are trained with CD, except the uppermost which is trained with PCD (TssRBM trained 
with FPCD). 



in the test region. A patch (and sample) of size 19x19 
was used to compute the score. We only use TSS for 
those regular textures (D6, D21, D53, D77). Fig. [2] 
compares images of textures synthesized by some of 
the methods under consideration. Table [l] provides a 
quantitative comparison based on the TSS and shows 
that the TssRBM-based models are competitive with 
these other probabilistic models of texture. 

Inpainting: The inpainting (constrained texture 
synthesis) task requires the models to generate a tex- 
ture which is consistent with a given boundary. Fol- 



lowing Kivinen and Williams (2012), we randomly cut 



76 X 76 texture patches from the test texture images 
and set the center (54 x 54) to zeros. The resulting 
images as the inpainting frames were fed to our mod- 
els. The inpainting was done by running 500 Gibbs 
sampling iterations in our models while the border was 
held fixed. The number of inpainting frames was 20 for 
each texture, and the inpainting were each done with 
5 different random seeds, making it a total of 100 in- 
paintings for each model and each texture. The quality 
of the inpainting was evaluated using the mean struc- 
tural similarity index (MSSIM) ( |Wang et 



2004) 



that compares the inpainted region and the ground 
truth. Fig. [3] compares the texture results of some of 
the methods under consideration. Table [l] provides the 
quantitative MSSIM comparison against other similar 
models. Here again, the TssRBM-based models are 
fairly competitive with these other probabilistic mod- 
els of texture. 

5.2 Experiments II: Exploring 
High-Resolution Textures 

To further explore the generative power of the DBN 
models, we move to a more challenging task, specifi- 



cally, modeling high-resolution textures while keeping 
the first layer structure unchanged: the same num- 
ber of filters, the same size (11 x 11) of the recep- 
tive fields and the same size (98 x 98) of the training 
patches. This implies that the first layer will face a 
much more challenging learning task. We show that 
by adding more hidden layers these difficult tasks are 
handled very well. We add two more hidden layers 
to the first layer. That gives us three layer DBNs. 
There are 128 filters with convolutional weight sharing 
in both of these two layers. Due to the limited sapce, 
we only show the results of texture D53, D4 and D103. 
The other 5 textures yield a similar pattern of results. 
While the quantitative measures used in the previous 
experiments are useful to extablish an objective com- 
parison between methods, we feel that they are rather 
imperfect measures of the quality of the texture model 
and therefore in this section we forgo these measures 
in favour of simply presenting texture synthesis results 
for visual inspection. Fig. |4] (right) illustrates the im- 
pact of both depth and the inductive bias (FPCD ver- 
sus CD training) in training TssRBM-based models of 
texture. 



Depth helps mixing. One key aspect that might 
help to explain the improvements in the models is that 
as the model gets deeper the mixing rate of the neg- 
ative phace Gibbs chain improves, as already demon- 



strated and argued in Bengio et al (2012). Improved 



mixing of the Gibbs chain improves the performance of 
training methods such as PCD that rely on it for the 
estimation of negative phase statistics. It also helps 
the generation of the samples shown. To demonstrate 
the improvement in mixing with depth, we assess the 
mixing rate of three models (one, two and three layer 
model) trained on D103 via the autocorrelation spec- 



Figure 5: LEFT: Multi-texture samples generated by the TssRBM model. RIGHT: Multi-texture samples 
generated by our 3-layer DBN. 



trum. After training, we run a Markov chain in all 
of three models and plot the autocorrelation spectrum 
in Fig. |4] (left). As seen in the figure, mixing be- 
come very fast in the three-layer model, i.e., samples 
at some distance in the chain are less correlated with 
each other. 

CD pretraining vs. PCD and FPCD pretrain- 
ing. We find that pretraining the lower layers with 
CD-I results in better DBN texture models. More 
specifically, worse results were obtained by PCD pre- 
training, then FPCD pre-training, and substantially 
better with CD, as can e.g. be seen in Figure [4] (right). 
This is consistent with the claims made in IHintonI 



(2012) regarding the advantage of CD vs PCD. It is 



also consistent with the results in |Le Roux and Bengio 
(2008), which show that maximum likelihood training 



of the lower layers of a DBN is sub-optimal, and that 
assuming a high-capacity top layer, the optimal way 
to train the first layer would be to minimize the KL 
divergence between the visible units and the stochas- 
tic one-step reconstruction, something much closer to 
what CD does than what PCD does. Another hypoth- 
esis is that CD helps here because it makes sure to ex- 
tract good features that preserve the input information, 
without the constraint that the lower level RBMs do a 
good job (of avoid spurious modes) far from the train- 
ing samples. Instead for the top-level RBM, which is 
used to sample from the model, it is important to use a 
good approximation to maximum likelihood training. 

5.3 Learning with Multiple Textures 

In this section, we try to assess the power of our deep 
models by using not only high-resolution texture im- 
ages but also multiple heterogeneous textures. We 
train a three layers model on all 8 textures. The first 
layer of our DBN is a TssRBM with 96 filters. The 
second layer is our new RBM variant introduced in 
Sec. [4] with 256 filters and receptive fields of size 2x2. 
The third layer is a convolutional binary RBM with 
256 filters and receptive fields of size 2x2. We com- 
pare our DBN with a one layer model (TssRBM with 



128 filters). After training, we generate samples from 
both models and show the results in Fig. [5] We can 
see that the single layer TssRBM only models the high 
frequency structure in the training data. On the other 
hand, the deep model seems to capture much of the 
8 textures that occur in the training set. There are 
7 different textures apparent in these 32 samples. We 
are only missing samples of D4, which is a stochastic 
texture and hard to capture, particularly when most of 



the training data are highly structured images. Kivi- 
'nen and Williams (2012) also trained Gaussian RBM 



with tiled-convolution weight sharing on multiple tex- 
tures with labels. The labels can help their model 
to pick different filters for different textures and thus 
make the learning problem much easier. 

6 Conclusions 

In this paper, we apply the ssRBM with tiled- 
convolution weight sharing on texture modeling task. 
We show that not only is the ssRBM competitive as 
a single layer model of texture, but that, by being 
amenable to CD training, it it well suited to being in- 
corporated into even more effective deep models of tex- 
ture. Interestingly, we find that CD training of lower 
layers yields better models, and that mixing is better 
in deeper layers. Our integration of the ssRBM into 
a DBN necessitated the development of a novel RBM 
with a spike-and-slab visible layer and a binary latent 
layer. Finally we show our new ssRBM-based DBN is 
capable of modeling multiple high-resolution textures. 

References 

Bengio, Y., Mesnil, G., Dauphin, Y., and Rifai, S. (2012). 
Better mixing via deep representations. Technical Re- 
port arXiv:1207.4404, Universite de Montreal. 

Courville, A., Bergstra, J., and Bengio, Y. (2011a). A 
Spike and Slab Restricted Boltzmann Machine. In AIS- 
T ATS '2011. 

Courville, A., Bergstra, J., and Bengio, Y. (2011b). Unsu- 
pervised models of images by spike-and-slab RBMs. In 
ICML'2011. 



Desjardins, G. and Bengio, Y. (2008). Empirical evalua- 
tion of convolutional RBMs for vision. Technical Report 
1327, Dept. IRQ, U. Montreal. 

Gregor, K. and LeCun, Y. (2010). Emergence of complex- 
like cells in a temporal product network with local re- 
ceptive fields. Technical report, arXiv: 1006.0448. 

Heess, N., Wilhams, C. K. I., and Hinton, G. E. (2009). 
Learning generative texture models with extended fields- 
of-experts. In BMVC . 

Hinton, G. E. (2012). Tutorial on deep learning. IPAM 
Graduate Summer School: Deep Learning, Feature 
Learning. 

Hinton, G. E., Osindero, S., and Teh, Y. (2006). A fast 
learning algorithm for deep belief nets. Neural Compu- 
tation, 18, 1527-1554. 

Kivinen, J. J. and Williams, C. K. I. (2012). Multiple 
texture Boltzmann machines. In Proceedings of the Fif- 
teenth International Conference on Artificial Intelligence 
and Statistics (AISTATS'2012), volume 22 of JMLR: 
W&CP. 

Le, Q., Ngiam, J., Chen, Z., hao Chia, D. J., Koh, P. W., 
and Ng, A. (2010). Tiled convolutional neural net- 
works. In J. Lafferty, C. K. I. Williams, J. Shawe- Taylor, 
R. Zemel, and A. Culotta, editors. Advances in Neu- 
ral Information Processing Systems 23 (NIPS'lO), pages 
1279-1287. 

Le Roux, N. and Bengio, Y. (2008). Representational 
power of restricted Boltzmann machines and deep be- 
lief networks. Neural Computation, 20(6), 1631-1649. 

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). 
Gradient based learning applied to document recogni- 
tion. IEEE, 86(11), 2278-2324. 

Lee, H., Grosse, R., Ranganath, R., and Ng, A. Y. (2009). 
Convolutional deep belief networks for scalable unsuper- 
vised learning of hierarchical representations. In L. Bot- 
tou and M. Littman, editors, ICML 2009. ACM, Mon- 
treal (Qc), Canada. 

Lin, W.-C, Hays, J. H., Wu, C, Kwatra, V., and Liu, Y. 
(2006). Quantitative evaluation on near regular texture 
synthesis. In Computer Vision and Pattern Recognition 
Conference (CVPR '06), volume 1, pages 427 - 434. 

Neal, R. M. (1994). Bayesian Learning for Neural Net- 
works. Ph.D. thesis. Dept. of Computer Science, Uni- 
versity of Toronto. 

Ranzato, M. and Hinton, G. H. (2010). Modeling 
pixel means and covariances using factorized third-order 
Boltzmann machines. In Proceedings of the Computer 
Vision and Pattern Recognition Conference (CVPR'IO), 
pages 2551-2558. IEEE Press. 

Ranzato, M., Mnih, V., and Hinton, G. (2010a). Gen- 
erating more realistic images using gated MRF's. In 
J. Lafferty, C. K. I. Williams, J. Shawe- Taylor, R. Zemel, 
and A. Culotta, editors, Advances in Neural Information 
Processing Systems 23 (NIPS'lO), pages 2002-2010. 

Ranzato, M., Mnih, V., and Hinton, G. (2010b). Gen- 
erating more realistic images using gated MRF's. In 
NIPS '20 10. 

Ranzato, M., Susskind, J., Mnih, V., and Hinton, G. E. 
(2011). On deep generative models with applications to 
recognition. In CVPR' 11, pages 2857-2864. 



Tieleman, T. (2008). Training restricted Boltzmann ma- 
chines using approximations to the likelihood gradient. 
In W. W. Cohen, A. McCallum, and S. T. Roweis, edi- 
tors, ICML 2008, pages 1064-1071. ACM. 

Tieleman, T. and Hinton, G. (2009). Using fast weights to 
improve persistent contrast ive divergence. In L. Bottou 
and M. Littman, editors, ICML 2009, pages 1033-1040. 
ACM. 

Wang, Z., Bovik, A. C, Sheikh, H. R., and Simoncelh, E. P. 
(2004). Image quality assessment: From error visibility 
to structural similarity. IEEE TRANSACTIONS ON 
IMAGE PROCESSING, 13(4), 600-612. 

Wei, L.-Y., Lefebvre, S., Kwatra, V., and Turk, G. (2009). 
State of the art in example-based texture synthesis. Eu- 
rographics' 09 State of the Art Reports. 

Welling, M., Hinton, G. E., and Osindero, S. (2003). Learn- 
ing sparse topographic representations with products of 
Student-t distributions. In NIPS '2002. 

Welling, M., Rosen-Zvi, M., and Hinton, G. E. (2005). Ex- 
ponential family harmoniums with an application to in- 
formation retrieval. In NIPS'O^, volume 17, Cambridge, 
MA. MIT Press. 

Zhu, S. C, Liu, X. W., and Wu, Y. N. (2000). Explor- 
ing texture ensembles by efficient Markov chain Monte- 
Carlo - towards a " trichromacy " theory of texture. IEEE 
Trans. PAMI, 22(6), 554-569. 



